26 research outputs found

    MASSIVELY PARALLEL ALGORITHMS FOR POINT CLOUD BASED OBJECT RECOGNITION ON HETEROGENEOUS ARCHITECTURE

    Get PDF
    With the advent of new commodity depth sensors, point cloud data processing plays an increasingly important role in object recognition and perception. However, the computational cost of point cloud data processing is extremely high due to the large data size, high dimensionality, and algorithmic complexity. To address the computational challenges of real-time processing, this work investigates the possibilities of using modern heterogeneous computing platforms and its supporting ecosystem such as massively parallel architecture (MPA), computing cluster, compute unified device architecture (CUDA), and multithreaded programming to accelerate the point cloud based object recognition. The aforementioned computing platforms would not yield high performance unless the specific features are properly utilized. Failing that the result actually produces an inferior performance. To achieve the high-speed performance in image descriptor computing, indexing, and matching in point cloud based object recognition, this work explores both coarse and fine grain level parallelism, identifies the acceptable levels of algorithmic approximation, and analyzes various performance impactors. A set of heterogeneous parallel algorithms are designed and implemented in this work. These algorithms include exact and approximate scalable massively parallel image descriptors for descriptor computing, parallel construction of k-dimensional tree (KD-tree) and the forest of KD-trees for descriptor indexing, parallel approximate nearest neighbor search (ANNS) and buffered ANNS (BANNS) on the KD-tree and the forest of KD-trees for descriptor matching. The results show that the proposed massively parallel algorithms on heterogeneous computing platforms can significantly improve the execution time performance of feature computing, indexing, and matching. Meanwhile, this work demonstrates that the heterogeneous computing architectures, with appropriate architecture specific algorithms design and optimization, have the distinct advantages of improving the performance of multimedia applications

    A 3D local descriptor SHOT on massively parallel processors

    No full text
    This paper investigates the suitability of graphical process unit (GPU) for a novel 3D object local descriptor named Signature of histogram of orientations (SHOT). SHOT presents a great balance between descriptiveness and robustness but its high complexity incurs heavy computational workload. We designed two parallel SHOTs on GPU to speed up the original serial counterpart. Experiment results show both the precise and approximate parallel algorithms exhibit outstanding speed performance. Moreover, we validated the descriptiveness of both parallel SHOTs were at high level through quantitative experimental comparisons

    Massive parallelization of approximate nearest neighbor search on KD-tree for high-dimensional image descriptor matching

    No full text
    To overcome the high computing cost associated with high-dimensional digital image descriptor matching, this paper presents a massively parallel approximate nearest neighbor search (ANNS) on K-dimensional tree (KD-tree) on the modern massively parallel architectures (MPA). The proposed algorithm is of comparable quality to traditional sequential counterpart on central processing unit (CPU). However, it achieves a high speedup factor of 121 when applied to high-dimensional real-world image descriptor datasets. The algorithm is also studied for factors that impact its performance to obtain the optimal runtime configurations for various datasets. The performance of the proposed parallel ANNS algorithm is also verified on typical 3D image matching scenarios. With the classical local image descriptor signature of histograms of orientations (SHOT), the parallel image descriptor matching can achieve speedup of up to 128. Our implementation will potentially benefit realtime image descriptor matching in high dimensions

    G-SHOT: GPU accelerated 3D local descriptor for surface matching

    No full text
    Signature of histogram of orientations (SHOT) as a novel 3D object local descriptor can achieves a good balance between descriptiveness and robustness in surface matching. However, its computation workload is much higher than the other 3D local descriptors. This paper investigates the development of suitable massively parallel algorithms on the graphics processing unit (GPU) for computation of high density and large scale 3D object local descriptors through two alternative parallel algorithms; one exact, and one approximate. Both algorithms exhibit outstanding speedup performance. The exact parallel descriptor comes at no cost to the descriptiveness, with a speedup factor of up to 40.70, with respect to the serial SHOT on the central processing unit (CPU). The approximate version achieves a corresponding speedup factor of up to 54 with minor degradation in descriptiveness. The proposed algorithms are integrated into point cloud library (PCL), a open source project for image and point cloud

    High-dimensional image descriptor matching using highly parallel KD-tree construction and approximate nearest neighbor search

    No full text
    To overcome the high computational cost associated with the high-dimensional digital image descriptor matching, this paper presents a set of integrated parallel algorithms for the construction of K-dimensional tree (KD-tree) and P approximate nearest neighbor search (P-ANNS) on the modern massively parallel architectures (MPA). To improve the runtime performance of the P-ANNS, we propose an efficient sliding window for a parallel buffered P-ANNS on KD-tree to mitigate the high cost of global memory accesses. When applied to high dimensional real-world image descriptor datasets, the proposed KD-tree construction and the buffered P-ANNS algorithms are of comparable matching quality to the traditional sequential counterparts on CPU, while outperforming their serial CPU counterparts by speedup factors of up to 17 and 163, respectively. The algorithms are also studied for the performance impact factors to obtain the optimal runtime configurations for various datasets. Moreover, we verify the features of the parallel algorithms on typical 3D image matching scenarios. With the classical local image descriptor signature of histograms of orientations (SHOT) datasets, the parallel KD-tree construction and image descriptor matching can achieve up to 11 and 138-fold speedups, respectively

    Parallel randomized KD-tree forest on GPU cluster for image descriptor matching

    No full text
    Many high dimensional data mining applications involve the nearest neighbor search (NNS) on a KD-tree. Randomized KD-tree forest enables fast medium and large scale NNS among high dimensional data points. In this paper, we present massively parallel algorithms for the construction of KD-tree forest, and NNS on a cluster equipped with massively parallel architecture (MPA) devices of graphical processing unit (GPU). This design can accelerate the KD-tree forest construction and NNS significantly for the signature of histograms of orientations (SHOT) 3D local descriptors by factors of up to 5.27 and 20.44, respectively. Our implementations will potentially benefit realtime high dimensional descriptors matching

    Implementation and evaluation of Raptor code on GPU

    No full text
    Raptor code, a member of the fountain code family, is a significant theoretical improvement over the Luby transform code (LT code) for forward error correction (FEC) transmission. Graphics processing units (GPUs) have become a common place in the consumer market and are finding their way beyond graphics processing into general purpose computing. This paper investigates the suitability of GPU for Raptor code to process large block and symbol sizes in FEC transmission. The serial and parallel implementations of Raptor code are explored on CPU and GPU, respectively. Our work show that the efficient parallelization on the GPU can improve the performance of the decoder significantly by a factor of up to 46. Furthermore, to understand the performance bottlenecks of Raptor code on both the GPU and CPU platforms, the decoding speed is evaluated in different block and symbol sizes. © 2012 IEEE

    Massively parallel KD-tree construction and nearest neighbor search algorithms

    No full text
    This paper presents parallel algorithms for the construction of k dimensional tree (KD-tree) and nearest neighbor search (NNS) on massively parallel architecture (MPA) of graphics processing unit (GPU). Unlike previous parallel algorithms for KD-tree, for the first time, our parallel algorithms integrate high dimensional KD-tree construction and NNS on an MPA platform. The proposed massively parallel algorithms are of comparable quality as traditional sequential counterparts on CPU, while achieve high speedup performance on both low and high dimensional KD-tree. Low dimensional KD-tree construction and NNS algorithms, presented in this paper, outperform their serial CPU counterparts by a factor of up to 24 and 218, respectively. For high dimensional KD-tree, the speedup factors are even higher, raising to 30 and 242, respectively. Our implementations will potentially benefit real time three-dimensional (3D) image registration and high dimensional descriptor matching

    Forward error correction with RaptorQ code on GPU

    No full text
    RaptorQ code, the next generation of Raptor code for forward error correction (FEC), is proposed to significantly reduce the redundant information. However, the improved coding performance comes at the expense of increased encoding and decoding complexity. On the other hand, graphics processing units (GPUs) are finding their way beyond graphics processing into general purpose computing in the consumer market. This paper investigates the suitability of GPU for RaptorQ code to process large block and symbol sizes in FEC transmission. The paper explores serial and parallel implementations of Raptor code on CPU and GPU, respectively. Our work shows that efficient parallelization on the GPU can improve the performance of the decoder significantly. Furthermore, simulations are performed for the practical real time requirement in multimedia broadcast/multicast service (MBMS) and digital video broadcasting (DVB) in highspeed downlink packet access (HSDPA) network. Conclusions are drawn with respect to the applicability of this new code for realtime multimedia broadcasting and content delivery on GPU. © 2013 IEEE

    A Chinese character teaching system using structure theory and morphing technology.

    No full text
    This paper proposes a Chinese character teaching system by using the Chinese character structure theory and the 2D contour morphing technology. This system, including the offline phase and the online phase, automatically generates animation for the same Chinese character from different writing stages to intuitively show the evolution of shape and topology in the process of Chinese characters teaching. The offline phase builds the component models database for the same script and the components correspondence database for different scripts. Given two or several different scripts of the same Chinese character, the online phase firstly divides the Chinese characters into components by using the process of Chinese character parsing, and then generates the evolution animation by using the process of Chinese character morphing. Finally, two writing stages of Chinese characters, i.e., seal script and clerical script, are used in experiment to show the ability of the system. The result of the user experience study shows that the system can successfully guide students to improve the learning of Chinese characters. And the users agree that the system is interesting and can motivate them to learn
    corecore